Video processing system with human action detection and methods for use therewith

ABSTRACT

A system for processing a video signal into a processed video signal includes a pattern recognition module for detecting a region of human action in the image sequence based on coding feedback data and generating pattern recognition data in response thereto. A video codec generates the processed video signal and generates the coding feedback data in conjunction with the processing of the image sequence.

CROSS REFERENCE TO RELATED PATENTS

The present application claims priority under 35 USC 119(e) to theprovisionally filed U.S. Application entitled, VIDEO PROCESSING SYSTEMWITH PATTERN DETECTION AND METHODS FOR USE THEREWITH, having Ser. No.61/635,034, and filed on Apr. 18, 2012, the contents of which areexpressly incorporated herein in its entirety by reference thereto.

TECHNICAL FIELD OF THE INVENTION

The present invention relates to coding used in devices such as videoencoders/decoders.

DESCRIPTION OF RELATED ART

Video encoding has become an important issue for modern video processingdevices. Robust encoding algorithms allow video signals to betransmitted with reduced bandwidth and stored in less memory. However,the accuracy of these encoding methods face the scrutiny of users thatare becoming accustomed to higher resolution and better picture quality.Standards have been promulgated for many encoding methods including theH.264 standard that is also referred to as MPEG-4, part 10 or AdvancedVideo Coding, (AVC). While this standard sets forth many powerfultechniques, further improvements are possible to improve the performanceand speed of implementation of such methods. Video encoding can be acomputationally complex task for high resolution video signals requiringlarge amounts of computations and storage.

Further limitations and disadvantages of conventional and traditionalapproaches will become apparent to one of ordinary skill in the artthrough comparison of such systems with the present invention.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

FIG. 1 presents a block diagram representation of a video processingsystem 102 in accordance with an embodiment of the present invention.

FIG. 2 presents a block diagram representation of a video processingsystem 102 in accordance with an embodiment of the present invention.

FIG. 3 presents a block diagram representation of a video processingsystem 102 in accordance with an embodiment of the present invention.

FIG. 4 presents a block diagram representation of a pattern recognitionmodule 125 in accordance with a further embodiment of the presentinvention.

FIG. 5 presents a temporal block diagram representation of shot data 154in accordance with a further embodiment of the present invention.

FIG. 6 presents a temporal block diagram representation of shot data 154in accordance with a further embodiment of the present invention.

FIG. 7 presents a block diagram representation of a post-processingmodule 160 in accordance with a further embodiment of the presentinvention.

FIG. 8 presents a tabular representation of a searchable index 162 inaccordance with a further embodiment of the present invention.

FIG. 9 presents a block diagram representation of a video processingsystem 102 in accordance with an embodiment of the present invention.

FIG. 10 presents a block diagram representation of a pattern recognitionmodule 125′ in accordance with a further embodiment of the presentinvention.

FIG. 11 presents a block diagram representation of a pattern detectionmodule 175 or 175′ in accordance with a further embodiment of thepresent invention.

FIG. 12 presents a pictorial representation of an image 370 inaccordance with a further embodiment of the present invention.

FIG. 13 presents a block diagram representation of a supplementalpattern recognition module 360 in accordance with an embodiment of thepresent invention.

FIG. 14 presents a temporal block diagram representation of shot data154 in accordance with a further embodiment of the present invention.

FIG. 15 presents a block diagram representation of a candidate regiondetection module 320 in accordance with a further embodiment of thepresent invention.

FIG. 16 presents a pictorial representation of an image 380 inaccordance with a further embodiment of the present invention.

FIGS. 17-19 present pictorial representations of image 390, 392 and 395in accordance with a further embodiment of the present invention.

FIG. 20 presents a block diagram representation of a video distributionsystem 75 in accordance with an embodiment of the present invention.

FIG. 21 presents a block diagram representation of a video storagesystem 79 in accordance with an embodiment of the present invention.

FIG. 22 presents a block diagram representation of a video server 80 inaccordance with an embodiment of the present invention.

FIG. 23 presents a flowchart representation of a method in accordancewith an embodiment of the present invention.

FIG. 24 presents a flowchart representation of a method in accordancewith an embodiment of the present invention.

FIG. 25 presents a flowchart representation of a method in accordancewith an embodiment of the present invention.

FIG. 26 presents a flowchart representation of a method in accordancewith an embodiment of the present invention.

FIG. 27 presents a flowchart representation of a method in accordancewith an embodiment of the present invention.

FIG. 28 presents a flowchart representation of a method in accordancewith an embodiment of the present invention.

FIG. 29 presents a flowchart representation of a method in accordancewith an embodiment of the present invention.

FIG. 30 presents a flowchart representation of a method in accordancewith an embodiment of the present invention.

FIG. 31 presents a flowchart representation of a method in accordancewith an embodiment of the present invention.

FIG. 32 presents a flowchart representation of a method in accordancewith an embodiment of the present invention.

FIG. 33 presents a flowchart representation of a method in accordancewith an embodiment of the present invention.

FIG. 34 presents a flowchart representation of a method in accordancewith an embodiment of the present invention.

FIG. 35 presents a flowchart representation of a method in accordancewith an embodiment of the present invention.

FIG. 36 presents a flowchart representation of a method in accordancewith an embodiment of the present invention.

FIG. 37 presents a flowchart representation of a method in accordancewith an embodiment of the present invention.

DETAILED DESCRIPTION OF THE INVENTION INCLUDING THE PRESENTLY PREFERREDEMBODIMENTS

FIG. 1 presents a block diagram representation of a video processingsystem 102 in accordance with an embodiment of the present invention. Inparticular, a video processing system 102 includes both a video codec103 and a pattern recognition module 125. Video encoding/decoding andpattern recognition are both computational complex tasks, especiallywhen performed on high resolution videos. Some temporal and spatialinformation, such as motion vectors and statistical information ofblocks and shot segmentation are useful for both tasks. So if the twotasks are developed together, they can share information and economizeon the efforts needed to implement these tasks.

In particular, a video system includes a receiving module 100, such as avideo server, set-top box, television receiver, personal computer, cabletelevision receiver, satellite broadcast receiver, broadband modem, 3Gtransceiver, network node, cable headend or other information receiveror transceiver that is capable of receiving one or more video signals110 from one or more sources such as video content providers, abroadcast cable system, a broadcast satellite system, the Internet, adigital video disc player, a digital video recorder, or other videosource. Video processing system 102 is coupled to the receiving module100 to encode, decode and/or transcode one or more of the video signals110 to form processed video signal 112 via the operation of video codec103.

In an embodiment of the present invention, the video signals 110 caninclude a broadcast video signal, such as a television signal, highdefinition televisions signal, enhanced high definition televisionsignal or other broadcast video signal that has been transmitted over awireless medium, either directly or through one or more satellites orother relay stations or through a cable network, optical network orother transmission network. In addition, the video signals 110 can begenerated from a stored video file, played back from a recording mediumsuch as a magnetic tape, magnetic disk or optical disk, and can includea streaming video signal that is transmitted over a public or privatenetwork such as a local area network, wide area network, metropolitanarea network or the Internet.

Video signal 110 and processed video signal 112 can each be differingones of an analog audio/video (A/V) signal that is formatted in any of anumber of analog video formats including National Television SystemsCommittee (NTSC), Phase Alternating Line (PAL) or Sequentiel CouleurAvec Memoire (SECAM). The video signal 110 and/or processed video signal112 can each be a digital audio/video signal in an uncompressed digitalaudio/video format such as high-definition multimedia interface (HDMI)formatted data, International Telecommunications Union recommendationBT.656 formatted data, inter-integrated circuit sound (I2S) formatteddata, and/or other digital A/V data formats.

The video signal 110 and/or processed video signal 112 can each be adigital video signal in a compressed digital video format such as H.264,MPEG-4 Part 10 Advanced Video Coding (AVC) or other digital format suchas a Moving Picture Experts Group (MPEG) format (such as MPEG1, MPEG2 orMPEG4), Quicktime format, Real Media format, Windows Media Video (WMV)or Audio Video Interleave (AVI), or another digital video format, eitherstandard or proprietary. When video signal 110 is received as digitalvideo and/or processed video signal 112 is produced in a digital videoformat, the digital video signal may be optionally encrypted, mayinclude corresponding audio and may be formatted for transport via oneor more container formats.

Examples of such container formats are encrypted Internet Protocol (IP)packets such as used in IP TV, Digital Transmission Content Protection(DTCP), etc. In this case the payload of IP packets contain severaltransport stream (TS) packets and the entire payload of the IP packet isencrypted. Other examples of container formats include encrypted TSstreams used in Satellite/Cable Broadcast, etc. In these cases, thepayload of TS packets contain packetized elementary stream (PES)packets. Further, digital video discs (DVDs) and Blu-Ray Discs (BDs)utilize PES streams where the payload of each PES packet is encrypted.

In operation, video codec 103 encodes, decodes or transcodes the videosignal 110 into a processed video signal 112. The pattern recognitionmodule 125 operates cooperatively with the video codec 103, in parallelor in tandem, and based on feedback data from the video codec 103generated in conjunction with the encoding, decoding or transcoding ofthe video signal 110. The pattern recognition module 125 processes imagesequences in the video signal 110 to detect patterns of interest. Whenone or more patterns of interest are detected, the pattern recognitionmodule 125 generates pattern recognition data 156, in response, thatindicates the pattern or patterns of interest. The pattern recognitiondata can take the form of data that identifies patterns andcorresponding features, like color, shape, size information, number andmotion, the recognition of objects or features, as well as the locationof these patterns or features in regions of particular images of animage sequence as well as the particular images in the sequence thatcontain these particular objects or features.

The feedback generated by the video codec 103 can take on many differentforms. For example, while temporal and spatial information is used byvideo codec 103 to remove redundancy, this information can also be usedby pattern recognition module 125 to detect or recognize features likesky, grass, sea, wall, buildings and building features such as the typeof building, the number of building stories, etc., moving vehicles andanimals (including people). Temporal feedback in the form of motionvectors estimated in encoding or retrieved in decoding (or motioninformation gotten by optical flow for very low resolution) can be usedby pattern recognition module 125 for motion-based pattern partition orrecognition via a variety of moving group algorithms. In addition,temporal information can be used by pattern recognition module 125 toimprove recognition by temporal noise filtering, providing multiplepicture candidates to be selected from for recognition of the best imagein an image sequence, as well as for recognition of temporal featuresover a sequence of images. Spatial information such as statisticalinformation, like variance, frequency components and bit consumptionestimated from input YUV or retrieved for input streams, can be used fortexture based pattern partition and recognition by a variety ofdifferent classifiers. More recognition features, like structure,texture, color and motion characters can be used for precise patternpartition and recognition. For instance, line structures can be used toidentify and characterize manmade objects such as building and vehicles.Random motion, rigid motion and relative position motion are effectiveto discriminate water, vehicles and animal respectively. Shot transitioninformation from encoding or decoding that identifies transitionsbetween video shots in an image sequence can be used to start newpattern detecting and reorganization and provide points of demarcationfor temporal recognition across a plurality of images.

In addition, feedback from the pattern recognition module 125 can beused to guide the encoding or transcoding performed by video codec 103.After pattern recognition, more specific structural and statisticallyinformation can be retrieved that can guide mode decision and ratecontrol to improve quality and performance in encoding or transcoding ofthe video signal 110. Pattern recognition can also generate feedbackthat identifies regions with different characteristics. These morecontextually correct and grouped motion vectors can improve quality andsave bits for encoding, especially in low bit rate cases. After patternrecognition, estimated motion vectors can be grouped and processed inaccordance with the feedback. In particular, pattern recognitionfeedback can be used by video codec 103 for bit allocation in differentregions of an image or image sequence in encoding or transcoding of thevideo signal 110. With pattern recognition and the codec runningtogether, they can provide powerful aids to each other.

The video processing system 102 can be implemented in conjunction withmany optional functions and features described in conjunction with FIGS.2-34 that follow.

FIG. 2 presents a block diagram representation of a video processingsystem 102 in accordance with an embodiment of the present invention. Inparticular, video processing system 102 includes a video codec 103having decoder section 240 and encoder section 236 that operates inaccordance with many of the functions and features of the H.264standard, the MPEG-4 standard, VC-1 (SMPTE standard 421M) or otherstandard, to decode, encode, transrate or transcode video input signals110 that are received via a signal interface 198 to generate theprocessed video signal 112. In conjunction with the encoding, decodingand/or transcoding of the video signal 110, the video codec 103generates or retrieves the decoded image sequence of the content ofvideo signal 110 along with coding feedback for transfer to the patternrecognition module 125. The pattern recognition module 125 operatesbased on an image sequence to generate pattern recognition data 156 andpattern recognition feedback for transfer back to the video codec 103.In particular, pattern recognition module 125 can operate viaclustering, statistical pattern recognition, syntactic patternrecognition or via other pattern detection algorithms or methodologiesto detect a pattern of interest in an image or image sequence (frame orfield) of video signal 110 and generate pattern recognition data 156 inresponse thereto.

The processing module 230 can be implemented using a single processingdevice or a plurality of processing devices. Such a processing devicemay be a microprocessor, co-processors, a micro-controller, digitalsignal processor, microcomputer, central processing unit, fieldprogrammable gate array, programmable logic device, state machine, logiccircuitry, analog circuitry, digital circuitry, and/or any device thatmanipulates signals (analog and/or digital) based on operationalinstructions that are stored in a memory, such as memory module 232.Memory module 232 may be a single memory device or a plurality of memorydevices. Such a memory device can include a hard disk drive or otherdisk drive, read-only memory, random access memory, volatile memory,non-volatile memory, static memory, dynamic memory, flash memory, cachememory, and/or any device that stores digital information. Note thatwhen the processing module implements one or more of its functions via astate machine, analog circuitry, digital circuitry, and/or logiccircuitry, the memory storing the corresponding operational instructionsmay be embedded within, or external to, the circuitry comprising thestate machine, analog circuitry, digital circuitry, and/or logiccircuitry.

Processing module 230 and memory module 232 are coupled, via bus 250, tothe signal interface 198 and a plurality of other modules, such aspattern recognition module 125, decoder section 240 and encoder section236. In an embodiment of the present invention, the signal interface198, video codec 103 and pattern recognition module 125 each operate inconjunction with the processing module 230 and memory module 232. Themodules of video processing module 102 can each be implemented insoftware, firmware or hardware, depending on the particularimplementation of processing module 230. It should also be noted thatthe software implementations of the present invention can be stored on atangible storage medium such as a magnetic or optical disk, read-onlymemory or random access memory and also be produced as an article ofmanufacture. While a particular bus architecture is shown, alternativearchitectures using direct connectivity between one or more modulesand/or additional busses can likewise be implemented in accordance withthe present invention.

FIG. 3 presents a block diagram representation of a video processingsystem 102 in accordance with an embodiment of the present invention. Aspreviously discussed, the video codec 103 generates the processed videosignal 112 based on the video signal, retrieves or generates imagesequence 310 and further generates coding feedback data 300. While thecoding feedback data 300 can include other temporal or spatial encodinginformation, the coding feedback data 300 includes shot transition datathat identifies temporal segments in the image sequence corresponding toa plurality of video shots that each include a plurality of images inthe image sequence 310.

The pattern recognition module 125 includes a shot segmentation module150 that segments the image sequence 310 into shot data 154corresponding to the plurality of shots, based on the coding feedbackdata 300. A pattern detection module 175 analyzes the shot data 154 andgenerates pattern recognition data 156 that identifies at least onepattern of interest in conjunction with at least one of the plurality ofshots.

In an embodiment, the shot segmentation module 150 operates based oncoding feedback 300 that includes shot transition data 152 generated,for example, by preprocessing information, like variance and downscaledmotion cost in encoding; and based on reference and bit consumptioninformation in decoding. Shot transition data 152 can not only beincluded in coding feedback 300, but also generated by video codec 103for use in GOP structure decision, mode selection and rate control toimprove quality and performance in encoding.

For example, encoding preprocessing information, like variance anddownscaled motion cost, can be used for shot segmentation. Based ontheir historical tracks, if variance and downscaled motion cost changedramatically, an abrupt shot transitions happens; when variances keepchanging monotonously and motion costs jump up and down at the start andend points of the monotonous variance changes, there is a gradual shottransition, like fade-in, fade-out, dissolve, and wipe. In decoding,frame reference information and bit consumption can be used similarly.The output shot transition data 152 can be used not only for GOPstructure decision, mode selection and rate control to improve qualityand performance in encoding, but also for temporal segmentation of theimage sequence 310 and as an enabler for frame-rate invariant shot levelsearching features.

Index data 115 can include a text string that identifies a pattern ofinterest for use in video storage and retrieval, and particularly tofind videos of interest (e.g. relating to sports or cooking), locatevideos containing certain scenes (e.g. a man and a woman on a beach),certain subject matter (e.g. regarding the American Civil War), certainvenues (e.g. the Eiffel Tower) certain objects (e.g. a Patek Phillipewatch), certain themes (e.g. romance, action, horror), etc. Videoindexing can be subdivided into five steps: modeling based ondomain-specific attributes, segmentation, extraction, representation,organization. Some functions, like shot (temporally and visuallyconnected frames) and scene (temporally and contextually connectedshots) segmentation, used in encoding can likewise be used in visualindexing.

In operation, the pattern detection module 175 operates via clustering,statistical pattern recognition, syntactic pattern recognition or viaother pattern detection algorithms or methodologies to detect a patternof interest in an image or image sequence 310 and generates patternrecognition data 156 in response thereto. In this fashion,object/features in each shot can be correlated to the shots that containthese objects and features that can be used for indexing and search ofindexed video for key objects/features and the shots that contain theseobjects/features. The indexing data 115 can be used for scenesegmentation in a server, set-top box or other video processing systembased on the extracted information and algorithms such as a hiddenMarkov model (HMM) algorithm that is based on a priori field knowledge.

Consider an example where video signal 110 contains a video broadcast.Index data 115 that indicates anchor shots and field shots showalternately could indicate a news broadcast; crowd shots and sportsshots shown alternately could indicate a sporting event. Sceneinformation can also be used for rate control, like quantizationparameter (QP) initialization at shot transition in encoding. Index data115 can be used to generate more high-level motive and contextualdescriptions via manual review by human personnel. For instance, basedon results mentioned above, operators could process index data 115 toprovide additional descriptors for an image sequence 310 to, forexample, describe an image sequence as “around 10 people (Adam, Brian .. . ) watching a live Elton John show on grass under the sky in theQueen's Park.”

The indexing data 115 can contain pattern recognition data 156 and otherhierarchical indexing information like: frame-level temporal and spatialinformation including variance, global motion and bit number etc.;shot-level objects and text string or other descriptions of featuressuch as text regions of a video, human and action description, objectinformation and background texture description etc.; scene-levelrepresents such as video category (news cast, sitcom, commercials,movie, sports or documentary etc.), and high-level context-leveldescriptions and presentations presented as text strings, numericalclassifiers or other data descriptors.

In addition, pattern recognition feedback 298 in the form of patternrecognition data 156 or other feedback from the pattern recognitionmodule 125 can be used to guide the encoding or transcoding performed byvideo codec 103. After pattern recognition, more specific structural andstatistically information can be generated as pattern recognitionfeedback 298 that can, for instance, guide mode decision and ratecontrol to improve quality and performance in encoding or transcoding ofthe video signal 110. Pattern recognition module 125 can also generatepattern recognition feedback 296 that identifies regions with differentcharacteristics. These more contextually correct and grouped motionvectors can improve quality and save bits for encoding, especially inlow bit rate cases. After pattern recognition, estimated motion vectorscan be grouped and processed in accordance with the pattern recognitionfeedback 298. In particular, the pattern recognition feedback 298 can beused by video codec 103 for bit allocation in different regions of animage or image sequence in encoding or transcoding of the video signal110.

FIG. 4 presents a block diagram representation of a pattern recognitionmodule 125 in accordance with a further embodiment of the presentinvention. As shown, the pattern recognition module 125 includes a shotsegmentation module 150 that segments an image sequence 310 into shotdata 154 corresponding to a plurality of shots, based on the codingfeedback data 300, such as shot transition data 152. The patterndetection module 175 analyzes the shot data 154 and generates patternrecognition data 156 that identifies at least one pattern of interest inconjunction with at least one of the plurality of shots.

The coding feedback data 300 can be generated by video codec 103 inconjunction with either a decoding of the video signal 110, an encodingof the video signal 110 or a transcoding of the video signal 110. Thevideo codec 103 can generate the shot transition data 152 based on imagestatistics group of picture data, etc. As discussed above, encodingpreprocessing information, like variance and downscaled motion cost, canbe used to generate shot transition data 152 for shot segmentation.Based on their historical tracks, if variance and downscaled motion costchange dramatically, an abrupt shot transitions happens; when varianceskeep changing monotonously and motion costs jump up and down at thestart and end points of the monotonous variance changes, there is agradual shot transition, like fade-in, fade-out, dissolve, and wipe. Indecoding, frame reference information and bit consumption can be usedsimilarly. The output shot transition data 152 can be used not only forGOP structure decision, mode selection and rate control to improvequality and performance in encoding, but also for temporal segmentationof the image sequence 310 and as an enabler for frame-rate invariantshot level searching features.

Further coding feedback data 300 can also be used by pattern detectionmodule 175. The coding feedback data can include one or more imagestatistics and the pattern recognition module 175 can generate thepattern recognition data 156 based on these image statistics to identifyfeatures such as faces, text, human actions, as well as other objectsand features. As discussed in conjunction with FIG. 1, temporal andspatial information used by video codec 103 to remove redundancy canalso be used by pattern detection module 175 to detect or recognizefeatures like sky, grass, sea, wall, buildings, moving vehicles andanimals (including people). Temporal feedback in the form of motionvectors estimated in encoding or retrieved in decoding (or motioninformation gotten by optical flow for very low resolution) can be usedby pattern detection module 175 for motion-based pattern partition orrecognition via a variety of moving group algorithms. Spatialinformation such as statistical information, like variance, frequencycomponents and bit consumption estimated from input YUV or retrieved forinput streams, can be used for texture based pattern partition andrecognition by a variety of different classifiers. More recognitionfeatures, like structure, texture, color and motion characters can beused for precise pattern partition and recognition. For instance, linestructures can be used to identify and characterize manmade objects suchas building and vehicles. Random motion, rigid motion and relativeposition motion are effective to discriminate water, vehicles and animalrespectively.

In addition to analysis of static images included in the shot data 154,shot data 154 can includes a plurality of images in the image sequence31, and the pattern detection module 175 can generate the patternrecognition data 156 based on a temporal recognition performed over aplurality of images within a shot. Slight motion within a shot andaggregation of images over a plurality of shots can enhance theresolution of the images for pattern analysis, can providethree-dimensional data from differing perspectives for the analysis andrecognition of three-dimensional objects and other motion can aid inrecognizing objects and other features based on the motion that isdetected.

Pattern detection module 175 generates the pattern feedback data 298 asdescribed in conjunction with FIG. 3 or other pattern recognitionfeedback that can be used by the video codec 103 in conjunction with theprocessing of video signal 110 into processed video signal 112. Theoperation of the pattern detection module 175 can be described inconjunction with the following additional examples.

In an example of operation, the video processing system 102 is part of aweb server, teleconferencing system security system or set top box thatgenerates indexing data 115 with facial recognition. The patterndetection module 175 operates based on coding feedback 300 that includemotion vectors estimated in encoding or retrieved in decoding (or motioninformation gotten by optical flow etc. for very low resolution),together with a skin color model used to roughly partition facecandidates. The pattern detection module 175 tracks a candidate facialregion over the plurality of images and detects a face in the imagebased on the one or more of these images. Shot transition data 152 incoding feedback 300 can be used to start a new series of face detectingand tracking.

For example, pattern detection module 175 can operate via detection ofcolors in image sequence 310. The pattern detection module 175 generatesa color bias corrected image from image sequence 310 and a colortransformed image from the color bias corrected image. Pattern detectionmodule 175 then operates to detect colors in the color transformed imagethat correspond to skin tones. In particular, pattern detection module175 can operate using an elliptic skin model in the transformed spacesuch as a C_(b)C_(r) subspace of a transformed YC_(b)C_(r) space. Inparticular, a parametric ellipse corresponding to contours of constantMahalanobis distance can be constructed under the assumption of Gaussianskin tone distribution to identify a detected region 322 based on atwo-dimension projection in the C_(b)C_(r) subspace. As exemplars, the853,571 pixels corresponding to skin patches from theHeinrich-Hertz-Institute image database can be used for this purpose,however, other exemplars can likewise be used in broader scope of thepresent invention.

In an embodiment, the pattern detection module 175 tracks a candidatefacial region over the plurality of images and detects a facial regionbased on an identification of facial motion in the candidate facialregion over the plurality of images, wherein the facial motion includesat least one of: eye movement; and the mouth movement. In particular,face candidates can be validated for face detection based on the furtherrecognition by pattern detection module 175 of facial features, like eyeblinking (both eyes blink together, which discriminates face motion fromothers; the eyes are symmetrically positioned with a fixed separation,which provides a means to normalize the size and orientation of thehead), shape, size, motion and relative position of face, eyebrows,eyes, nose, mouth, cheekbones and jaw. Any of these facial features canbe used extracted from the shot data 154 and used by pattern detectionmodule 175 to eliminate false detections. Further, the pattern detectionmodule 175 can employ temporal recognition to extract three-dimensionalfeatures based on different facial perspectives included in theplurality of images to improve the accuracy of the recognition of theface. Using temporal information, the problems of face detectionincluding poor lighting, partially covering, size and posturesensitivity can be partly solved based on such facial tracking.Furthermore, based on profile view from a range of viewing angles, moreaccurate and 3D features such as contour of eye sockets, nose and chincan be extracted.

In addition to generating pattern recognition data 156 for indexing, thepattern recognition data 156 that indicates a face has been detected andthe location of the facial region can also be used as patternrecognition feedback 298. The pattern recognition data 156 can includefacial characteristic data such as position in stream, shape, size andrelative position of face, eyebrows, eyes, nose, mouth, cheekbones andjaw, skin texture and visual details of the skin (lines, patterns, andspots apparent in a person's skin), or even enhanced, normalized andcompressed face images. In response, the encoder section 236 can guidethe encoding of the image sequence based on the location of the facialregion. In addition, pattern recognition feedback 298 that includesfacial information can be used to guide mode selection and bitallocation during encoding. Further, the pattern recognition data 156and pattern recognition feedback 298 can further indicate the locationof eyes or mouth in the facial region for use by the encoder section 236to allocate greater resolution to these important facial features. Forexample, in very low bit rate cases the encoder section 236 can avoidthe use of inter-mode coding in the region around blinking eyes and/or atalking mouth, allocating more encoding bits should to these face areas.

In a further example of operation, the video processing system 102 ispart of a web server, teleconferencing system security system or set topbox that generates indexing data 115 with text recognition. In thisfashion, text data such as automobile license plate numbers, storesigns, building names, subtitles, name tags, and other text potions inthe image sequence 310 can be detected and recognized. Text regionstypically have obvious features that can aid detection and recognition.These regions have relatively high frequency; they are usually have highcontrast in a regular shape; they are usually aligned and spacedequally; they tend to move with background or objects.

Coding feedback 300 can be used by the pattern detection module 175 toaid in detection. For example, shot transition data from encoding ordecoding can be used to start a new series of text detecting andtracking. Statistical information, like variance, frequency componentand bit consumption, estimated from input YUV or retrieved from inputstreams can be used for text partitioning. Edge detection, YUVprojection, alignment and spacing information, etc. can also be used tofurther partition interest text regions. Coding feedback data 300 in theform of motion vectors can be retrieved for the identified text regionsin motion compensation. Then reliable structural features, like lines,ends, singular points, shape and connectivity can be extracted.

In this mode of operation, the pattern detection module 175 generatespattern recognition data 156 that can include an indication that textwas detected, a location of the region of text and indexing data 115that correlates the region of text to a corresponding video shots. Thepattern detection module 175 can further operate to generate a textstring by recognizing the text in the region of text and further togenerate index data 115 that includes the text string correlated to thecorresponding video shot. The pattern recognition module 175 can operatevia a trained hierarchical and fuzzy classifier, neural network and/orvector processing engine to recognize text in a text region and togenerate candidate text strings. These candidate text strings mayoptionally be modified later into final text by post processing orfurther offline analysis and processing of the shot data.

The pattern recognition data 156 can be included in pattern recognitionfeedback 298 and used by the encoder section 236 to guide the encodingof the image sequence. In this fashion, text region information canguide mode selection and rate control. For instance, small partitionmode can be avoided in a small text region; motions vector can begrouped around text; and high quantization steps can be avoided in textregions, even in very low bit rate case to maintain adequatereproduction of the text.

In another example of operation, the video processing system 102 is partof a web server, teleconferencing system security system or set top boxthat generates indexing data 115 with recognition of human action. Inthis fashion and region of human action can be determined along with thedetermination of human action descriptions such as a number of people,body sizes and features, pose types, position, velocity and actions suchas kick, throw, catch, run, walk, fall down, loiter, drop an item, etc.can be detected and recognized.

Coding feedback 300 can be used by the pattern detection module 175 toaid in detection. For example, shot transition data from encoding ordecoding can be used to start a new series of action detecting andtracking. Motion vectors from encoding or decoding (or motioninformation gotten by optical flow etc. for very low resolution) can beemployed for this purpose.

In this mode of operation, the pattern detection module 175 generatespattern recognition data 156 that can include an indication that humanwas detected, a location of the region of the human and indexing data115 that includes, for example human action descriptors and correlatesthe human action to a corresponding video shot. The pattern detectionmodule 175 can subdivide the process of human action recognition into:moving object detecting, human discriminating, tracking, actionunderstanding and recognition. In particular, the pattern detectionmodule 175 can identify a plurality of moving objects in the pluralityof images. For example, motion objects can be partitioned frombackground. The pattern detection module 175 can then discriminate oneor more humans from the plurality of moving objects. Human motion can benon-rigid and periodic. Shape-based features, including color and shapeof face and head, width-height-ratio, limb positions and areas, tileangle of human body, distance between feet, projection and contourcharacter, etc. can be employed to aid in this discrimination. Theseshape, color and/or motion features can be recognized as correspondingto human action via a classifier such as neural network. The action ofthe human can be tracked over the images in a shot and a particular typeof human action can be recognized in the plurality of images.Individuals, presented as a group of corners and edges etc., can beprecisely tracked using algorithms such as model-based and activecontour-based algorithm. Gross moving information can be achieved via aKalman filter or other filter techniques. Based on the trackinginformation, action recognition can be implemented by Hidden MarkovModel, dynamic Bayesian networks, syntactic approaches or via otherpattern recognition algorithm.

The pattern recognition data 156 can be included in pattern recognitionfeedback 298 and used by the encoder section 236 to guide the encodingof the image sequence. In this fashion, presence and location of humanaction can guide mode selection and rate control. For instance, inside ashot, moving prediction information, trajectory analysis or other humanaction descriptors generated by pattern detection module 175 and outputas pattern recognition feedback 298 can assist the video codec 103 inmotion estimation in encoding.

FIG. 5 presents a temporal block diagram representation of shot data 154in accordance with a further embodiment of the present invention. In theexample, presented a video signal 110 includes an image sequence 310 ofa sporting event such as a football game that is processed by shotsegmentation module 150 into shot data 154. Coding feedback data 300from the video codec 103 includes shot transition data that indicateswhich images in the image sequence fall within which of the four shotsthat are shown. A first shot in the temporal sequence is a commentatorshot, the second and fourth shots are shots of the game and the thirdshot is a shot of the crowd.

FIG. 6 presents a temporal block diagram representation of shot data 154in accordance with a further embodiment of the present invention.Following with the example of FIG. 5, the pattern detection module 175analyzes the shot data 154 in the four shots, based on the imagesincluded in each of the shots as well as temporal and spatial codingfeedback data 300 from video codec 103 to recognize the first shot asbeing a commentator shot, the second and fourth shots as being shots ofthe game and the third shot is being a shot of the crowd.

The pattern detection module 175 generates pattern recognition data 156in conjunction with each of the shots that identifies the first shot asbeing a commentator shot, the second and fourth shots as being shots ofthe game and the third shot is being a shot of the crowd. The patternrecognition data 156 is correlated to the shot transition data 152 toidentify the location of each shot in the image sequence 310 and toidentify each shot with the corresponding pattern recognition data 156,an optionally to identify a region within the shot by image and/orwithin one or more images that include the identified subject matter.

FIG. 7 presents a block diagram representation of a post-processingmodule 160 in accordance with a further embodiment of the presentinvention. In particular, a post processing module 160 is presented thatfurther processes the indexing data 115 into generate a searchable index162. In an embodiment, post processing module 160 generates thesearchable index 162 from the index data 115 by correlating commoncontent from the plurality of shots. Considering the example presentedin conjunction with FIG. 6, the second and fourth shots would be matchedtogether as both being game shots and placed in a hierarchical structurethat under a common label, but preserving the range of imagescorresponding to each of the shots.

In a further embodiment, post processing module could process thepattern recognition data 156 to provide further pattern recognitioncapabilities. Consider an example in conjunction with FIG. 6, where thevideo signal is received by video processing system 102 in the form of adigital video recorder (DVR) at a user's home. The video signal 110 isprocessed by the DVR to generate processed video signal 112 for storageon the internal hard drive of the device for later playback. Indexingdata 115 is generated by the video processing system 102 in conjunctionwith the storage of the processed video signal. In addition, the DVRsends the indexing data 115 to the post processing module 160 that iseither local to the DVR or implemented in a remote server that isaccessed by the DVR via the Internet or other network connection.

The post processing module 160 analyzes the indexing data 115 optionallyincluded with the processed video signal 112 on a non-real-time basis togenerate the searchable index 162 and optionally additional indexingbased on further pattern recognition. In this fashion, the searchableindex 162 can optionally perform additional pattern recognition toidentify celebrities or other persons in the shots, specific buildings,text, products, venues, of a type and/or in regions of interestidentified by the pattern recognition data 156. The searchable indexdata 162 is stored on either the remote server or on the DVR itself toallow a user, via a search feature of the DVR or server to locate andaccess portions of the stored video recording that contain features,such as shots that include scoring, shots with particular people, shotswith particular objects or venues, etc.

In another example in conjunction with FIG. 6, the video signal isreceived by video processing system 102 in the form of a codecimplemented via a network server as part of the encoding performed toupload a video to a social media website such as YouTube or Facebook,etc. The video signal 110 is processed by the codec to generateprocessed video signal 112 for storage on the server for later playbackby the user or other users. Indexing data 115 is generated by the videoprocessing system 102 in conjunction with the storage of the processedvideo signal. In addition, the server sends the indexing data 115 to apost processing module 160 that is either local to the server orimplemented in a remote server that is accessed by the server via theInternet or other network connection. The post processing module 160analyzes the indexing data 115 optionally included with the processedvideo signal 112 on a non-real-time basis to generate the searchableindex 162 and optionally additional indexing based on further patternrecognition. As in the prior example, the searchable index 162 canoptionally perform additional pattern recognition to identifycelebrities or other persons in the shots, specific buildings, text,products, venues, of a type and/or in regions of interest identified bythe pattern recognition data 156. The searchable index data 162 isstored on either the server or on a remote server to allow users, via asearch feature, to locate and access portions of the stored videorecording that contain features, such as shots that include scoring,shots with particular people, shots with particular objects or venues,etc.

FIG. 8 presents a tabular representation of a searchable index 162 inaccordance with a further embodiment of the present invention. Inanother example in conjunction with FIGS. 6 & 7, a searchable index 162is presented in tabular form where crowd shots and game shots are placedin an index structure under common labels. The range of imagescorresponding to each of the shots is indicated by a correspondingaddress range that can be used to quickly locate a particular shot orset of shots within a video.

FIG. 9 presents a block diagram representation of a video processingsystem 102 in accordance with an embodiment of the present invention. Inparticular, a specific embodiment of video processing system 102 isshown where pattern recognition module 125′ operates in a similarfashion to pattern recognition module 125, but in conjunction encodersection 236 of video codec 103. In this embodiment, coding feedback 300takes the form of encoder feedback 296 and pattern recognition feedbackdata 298 is used by encoder section 236 in the encoding of video signal110 into processed video signal 112. The encoder section 236 generatesthe encoder feedback data 296 in conjunction with the encoding of thevideo signal 110.

In operation, the pattern recognition module 125′ detects a pattern ofinterest in the image sequence 310 of video signal 110, based on encoderfeedback data 296. The image sequence 310 can be extracted directly fromvideo signal 110 or received via the encoder section 236 in conjunctionwith the processing of video signal 110 as presented in conjunction withFIG. 3. The pattern recognition module 125′ generates the patternrecognition data 156 to indicate the pattern of interest. As previouslydiscussed, the pattern of interest could be a face, text, human action,or a wide range of other features or objects.

In this embodiment, the encoder section 236 generates the processedvideo signal 112 based on pattern recognition feedback 296 that includesthe pattern recognition data 156. In particular, the encoder section 235guides the encoding of the video signal 110 based on pattern recognitionfeedback 296 that indicates the pattern of interest was detected. Forexample, the pattern recognition feedback 296 can include regionidentification data that identifies a region of interest and the encodersection 236 can guide the encoding of the video signal 110 based on theregion identification data.

As previously discussed, the encoder feedback data 296 includes shottransition data, such as shot transition data 152, that identifiestemporal segments in the image sequence 310 corresponding to a pluralityof video shots. The pattern recognition module 125′ can generate thepattern recognition data 156 corresponding to at least one of theplurality of video shots. The shots can include a plurality of images ofthe image sequence 310 and the pattern recognition module 125′ cangenerate the pattern recognition data 156 based on a temporalrecognition performed over the plurality of images. Pattern recognitionmodule 125′ can further generate indexing data 115, in the form ofpattern recognition data 156 that includes an identification of thepattern of interest and shot transition data 152 derived from encodingfeedback data 296 or other data that includes an identification of atleast one corresponding shot of the plurality of video shots thatincludes the pattern of interest.

FIG. 10 presents a block diagram representation of a pattern recognitionmodule 125′ in accordance with a further embodiment of the presentinvention. Pattern recognition module 125′ operates in a similar fashionto pattern recognition module 125 presented in conjunction with FIG. 4.The encoder feedback data 296 can include the same quantities describedin conjunction with feedback data 300. In this embodiment shotsegmentation module 150′ operates in a similar fashion as shotsegmentation module 150 to segment the image sequence 310 eitherdirectly or as extracted from the video signal 110. The patterndetection module 175 analyzes the shot data 154 generated by andgenerates pattern recognition data 156 that identifies at least onepattern of interest in conjunction with at least one of the plurality ofshots.

Like coding feedback data 300, encoder feedback 296 can be generated byvideo encoder section 236 in conjunction with an encoding of the videosignal 110 or a transcoding of the video signal 110. The video encodersection 236 can generate the shot transition data 152 based on imagestatistics, group of picture data, etc. As discussed above, encodingpreprocessing information, like variance and downscaled motion cost, canbe used to generate shot transition data 152 for shot segmentation.Based on their historical tracks, if variance and downscaled motion costchange dramatically, an abrupt shot transitions happens; when varianceskeep changing monotonously and motion costs jump up and down at thestart and end points of the monotonous variance changes, there is agradual shot transition, like fade-in, fade-out, dissolve, and wipe. Theshot transition data 152 can be used not only for GOP structuredecision, mode selection and rate control to improve quality andperformance in encoding by video encoder section 236 but also output asa portion of encoder feedback data 296 for temporal segmentation of theimage sequence 310 and as an enabler for frame-rate invariant shot levelsearching features.

Further encoder feedback 296 can also be used by pattern detectionmodule 175. The coding feedback data can include one or more imagestatistics and the pattern recognition module 175 can generate thepattern recognition data 156 based on these image statistics to identifyfeatures such as faces, text, human actions, as well as other objectsand features. As discussed in conjunction with FIG. 1, temporal andspatial information used by video codec 103 to remove redundancy canalso be used by pattern detection module 175 to detect or recognizefeatures like sky, grass, sea, wall, buildings, moving vehicles andanimals (including people). Temporal feedback in the form of motionvectors estimated in encoding or retrieved in decoding (or motioninformation gotten by optical flow for very low resolution) can be usedby pattern detection module 175 for motion-based pattern partition orrecognition via a variety of moving group algorithms. Spatialinformation such as statistical information, like variance, frequencycomponents and bit consumption estimated from input YUV or retrieved forinput streams, can be used for texture based pattern partition andrecognition by a variety of different classifiers. More recognitionfeatures, like structure, texture, color and motion characters can beused for precise pattern partition and recognition. For instance, linestructures can be used to identify and characterize manmade objects suchas building and vehicles. Random motion, rigid motion and relativeposition motion are effective to discriminate water, vehicles and animalrespectively.

As previously discussed, in addition to analysis of static imagesincluded in the shot data 154, shot data 154 can include a plurality ofimages in the image sequence 310, and the pattern detection module 175can generate the pattern recognition data 156 based on a temporalrecognition performed over a plurality of images within a shot. Slightmotion within a shot and aggregation of images over a plurality of shotscan enhance the resolution of the images for pattern analysis, canprovide three-dimensional data from differing perspectives for theanalysis and recognition of three-dimensional objects and other motioncan aid in recognizing objects and other features based on the motionthat is detected.

FIG. 11 presents a block diagram representation of a pattern detectionmodule 175 or 175′ in accordance with a further embodiment of thepresent invention. In particular, pattern detection module 175 or 175′includes a candidate region detection module 320 for detecting adetected region 322 in at least one image of image sequence 310. Inoperation, the candidate region detection module 332 can detect thepresence of a particular pattern or other region of interest to berecognized as a particular region type. An example of such a pattern isa human face or other face, human action, text, or other object orfeature. Pattern detection module 175 or 175′ optionally includes aregion cleaning module 324 that generates a clean region 326 based onthe detected region 322, such via a morphological operation. Patterndetection module 175 or 175′ further includes a region growing module328 that expands the clean region 326 to generate a regionidentification signal 330 that identifies the region containing thepattern of interest. The identified region type 332 and the regionidentification data can be output as pattern recognition feedback data298.

Considering, for example, the case where the shot data 154 includes ahuman face and the pattern detection module 175 or 175′ generates aregion corresponding the human face, candidate region detection module320 can generate detected region 322 based on the detection of pixelcolor values corresponding to facial features such as skin tones. Regioncleaning module can generate a more contiguous region that containsthese facial features and region growing module can grow this region toinclude the surrounding hair and other image portions to ensure that theentire face is included in the region identified by regionidentification signal 330.

As previously discussed, the encoder feedback data 296 includes shottransition data, such as shot transition data 152, that identifiestemporal segments in the image sequence 310 that are used to bound theshot data 154 to a particular set of images in the image sequence 310.The candidate region detection module 320 further operates based onmotion vector data to track the position of candidate region through theimages in the shot data 154. Motion vectors, shot transition data andother encoder feedback data 296 are also made available to regiontracking and accumulation module 334 and region recognition module 350.The region tracking and accumulation module 334 provides accumulatedregion data 336 that includes a temporal accumulation of the candidateregions of interest to enable temporal recognition via regionrecognition module 350. In this fashion, region recognition module 350can generate pattern recognition data based on such features as facialmotion, human actions, three-dimensional modeling and other featuresrecognized and extracted based on such temporal recognition.

FIG. 12 presents a pictorial representation of an image 370 inaccordance with a further embodiment of the present invention. Inparticular, an example image of image sequence 310 is shown thatincludes a portion of a particular football stadium as part of videobroadcast of a football game. In accordance with this example, patterndetection module 175 or 175′ generates region type data 332 included inboth pattern recognition feedback data 298 and pattern recognition data156 that indicates that text is present and region identification data330 that indicates that region 372 that contains the text in thisparticular image. The pattern recognition module 350 operates based onthis region 372 and optionally based on other accumulated regions thatinclude this text to generate further pattern recognition data 156 thatincludes the recognized text string, “Lucas Oil Stadium”.

FIG. 13 presents a block diagram representation of a supplementalpattern recognition module 360 in accordance with an embodiment of thepresent invention. While the embodiment of FIG. 12 is described based onrecognition of the text string “Lucas Oil Stadium” via the operation ofpattern recognition module 350, in another embodiment, the patternrecognition data 156 generated by pattern detection module 175 couldmerely include pattern descriptors, regions types and region data foroff-line recognition into feature/object recognition data 362 viasupplemental recognition module 360. In an embodiment, the supplementalrecognition module 360 implements one or more pattern recognitionalgorithms. While described above in conjunction with the example ofFIG. 12, the supplemental recognition module 360 can be used inconjunction with any of the other examples previously described torecognize a face, a particular person, a human actions, or otherfeatures/objects indicated by pattern recognition data 156. In effect,the functionality of pattern recognition module 350 is included in thesupplemental recognition module 360, rather than in pattern detectionmodule 175 or 175′.

The supplemental recognition module 360 can be implemented using asingle processing device or a plurality of processing devices. Such aprocessing device may be a microprocessor, co-processors, amicro-controller, digital signal processor, microcomputer, centralprocessing unit, field programmable gate array, programmable logicdevice, state machine, logic circuitry, analog circuitry, digitalcircuitry, and/or any device that manipulates signals (analog and/ordigital) based on operational instructions that are stored in a memory.Such a memory may be a single memory device or a plurality of memorydevices. Such a memory device can include a hard disk drive or otherdisk drive, read-only memory, random access memory, volatile memory,non-volatile memory, static memory, dynamic memory, flash memory, cachememory, and/or any device that stores digital information. Note thatwhen the supplemental recognition module 360 implements one or more ofits functions via a state machine, analog circuitry, digital circuitry,and/or logic circuitry, the memory storing the corresponding operationalinstructions may be embedded within, or external to, the circuitrycomprising the state machine, analog circuitry, digital circuitry,and/or logic circuitry.

FIG. 14 presents a temporal block diagram representation of shot data154 in accordance with a further embodiment of the present invention. Inparticular, various shots of shot data 154 are shown in conjunction withthe video broadcast of a football game described in conjunction withFIG. 12. The first shot shown is a stadium shot that include the image370. The indexing data corresponding to this shot includes anidentification of the shot as a stadium shot as well as the text string“Lucas Oil Stadium”. The other indexing data indicates the second andfourth shots as being shots of the game and the third shot is being ashot of the crowd.

A previously discussed, the indexing data generated in this fashioncould be used to generate a searchable index of this video along withother video as part of a video search system. A user of the videoprocessing system 102 could search videos for “Lucas Oil Stadium” andnot only identify the particular video broadcast, but also identify theparticular shot or shots within the video, such as the shot containingimage 370, that contain a text region, such as text region 372 thatgenerated the search string “Lucas Oil Stadium”.

FIG. 15 presents a block diagram representation of a candidate regiondetection module 320 in accordance with a further embodiment of thepresent invention. In this embodiment, region detection module 320operates via detection of colors in image 310. Color bias correctionmodule 340 generates a color bias corrected image 342 from image 310.Color space transformation module 344 generates a color transformedimage 346 from the color bias corrected image 342. Color detectionmodule generates the detected region 322 from the colors of the colortransformed image 346.

For instance, following with the example discussed in conjunction withFIG. 3 where human faces are detected, color detection module 348 canoperate to detect colors in the color transformed image 346 thatcorrespond to skin tones using an elliptic skin model in the transformedspace such as a C_(b)C_(r) subspace of a transformed YC_(b)C_(r) space.In particular, a parametric ellipse corresponding to contours ofconstant Mahalanobis distance can be constructed under the assumption ofGaussian skin tone distribution to identify a detected region 322 basedon a two-dimension projection in the C_(b)C_(r) subspace. As exemplars,the 853,571 pixels corresponding to skin patches from theHeinrich-Hertz-Institute image database can be used for this purpose,however, other exemplars can likewise be used in broader scope of thepresent invention.

FIG. 16 presents a pictorial representation of an image 380 inaccordance with a further embodiment of the present invention. Inparticular, an example image of image sequence 310 is shown thatincludes a player punting a football as part of video broadcast of afootball game. In accordance with this example, pattern detection module175 or 175′ generates region type data 332 included in both patternrecognition feedback data 298 and pattern recognition data 156 thatindicates that human action is present and region identification data330 that indicates that region 382 that contains the human action inthis particular image. The pattern recognition module 350 orsupplemental pattern recognition module 360 operate based on this region382 and based on other accumulated regions that include similar regionscontaining the punt to generate further pattern recognition data 156that includes human action descriptors such as “football player”,“kick”, “punt” or other descriptors that characterize this particularhuman action.

FIGS. 17-19 present pictorial representations of images 390, 392 and 395in accordance with a further embodiment of the present invention. Inparticular, example images of image sequence 310 are shown that follow apunted a football as part of video broadcast of a football game. Inaccordance with this example, pattern detection module 175 or 175′generates region type data 332 included in both pattern recognitionfeedback data 298 and pattern recognition data 156 that indicates thepresence of an object such as a football is present and regionidentification data 330 that indicates that regions 391, 393 and 395contains the football in each corresponding images 390, 392 and 394.

The pattern recognition module 350 or supplemental pattern recognitionmodule 360 operate based on accumulated regions 391, 393 and 395 thatinclude similar regions containing the punt to generate further patternrecognition data 156 that includes human action descriptors such as“football play”, “kick”, “punt”, information regarding the distance,height, trajectory of the ball and/or other descriptors thatcharacterize this particular action.

It should be noted, that while the descriptions of FIGS. 9-19 havefocused on an encoder section 236 that generates encoding feedback data296 and the guides encoding based on pattern recognition data 298,similar techniques could likewise be used in conjunction with a decodersection 240 or transcoding performed by video codec 103 to generatecoding feedback data 300 that is used by pattern recognition module 125to generate pattern recognition feedback data that is used by the videocodec 103 or decoder section 240 to guide encoding or transcoding of theimage sequence.

FIG. 20 presents a block diagram representation of a video distributionsystem 75 in accordance with an embodiment of the present invention. Inparticular, a video signal 50 is encoded by a video encoding system 52into encoded video signal 60 for transmission via a transmission path122 to a video decoder 62. Video decoder 62, in turn can operate todecode the encoded video signal 60 for display on a display device suchas television 10, computer 20 or other display device. The videoprocessing system 102 can be implemented as part of the video encoder 52or the video decoder 62 to generate pattern recognition data 156 and/orindexing data 115 from the content of video signal 50.

The transmission path 122 can include a wireless path that operates inaccordance with a wireless local area network protocol such as an 802.11protocol, a WIMAX protocol, a Bluetooth protocol, etc. Further, thetransmission path can include a wired path that operates in accordancewith a wired protocol such as a Universal Serial Bus protocol, anEthernet protocol or other high speed protocol.

FIG. 21 presents a block diagram representation of a video storagesystem 79 in accordance with an embodiment of the present invention. Inparticular, device 11 is a set top box with built-in digital videorecorder functionality, a stand alone digital video recorder, a DVDrecorder/player or other device that records or otherwise stores adigital video signal for display on video display device such astelevision 12. The video processing system 102 can be implemented indevice 11 as part of the encoding, decoding or transcoding of the storedvideo signal to generate pattern recognition data 156 and/or indexingdata 115.

While these particular devices are illustrated, video storage system 79can include a hard drive, flash memory device, computer, DVD burner, orany other device that is capable of generating, storing, encoding,decoding, transcoding and/or displaying a video signal in accordancewith the methods and systems described in conjunction with the featuresand functions of the present invention as described herein.

FIG. 22 presents a block diagram representation of a video server 80 inaccordance with an embodiment of the present invention. Video systemsuch as video server 80 implemented as a network server, web server orother network node or video system includes video processing system 102that generates searchable index 162 in conjunction with the storageand/or transmission of the multiple video files or streams in videolibrary 82. The video server 80 includes an interface such as a webinterface implemented in conjunction with a user's browser. Users of thevideo server 80 can supply search terms 398 to identify videos andparticular shots within the video content that include celebrities orother persons, specific buildings, text of interest, products, venues,particular human actions or other objects/features of interest. In thiscase, the search module 86 compares the search terms 398 to thesearchable index to locate one or more matching video signals from thevideo library 82 that match the search terms 398. One or more of thematching video signals from the video library 82 can be selected by theuser for streaming or download, based on the results of the searching,as the video signal 84.

For example, the video server 80 or other video system employs the videoprocessing system 102 to generate a plurality of text strings thatdescribe the videos of the video library 82 in conjunction with theencoding/decoding and/or transcoding these videos. A memory 88, coupledto the video processing system 102, stores a searchable index 162 thatincludes the plurality of text strings. The search module 86 identifiesmatching video from the video library 82 by comparing the search terms398 or other input text strings to the plurality of text strings of thesearchable index 162. Because the video processing system 102 generatesthe plurality of text strings to correspond to particular shots of thevideos of video library 82, the search module 86 can further identifymatching shots in the matching videos that contain the images thatcorrespond to the search terms 398. In this fashion, a user can usesearch terms to search on particular, people, faces, text, human actionsor other recognized objects, events, places or other things in the videolibrary 82 and not only generate particular videos of the video library82 that correspond to these search terms, but also be directed to theparticular shot or shots in these matching videos that contain therecognized person, face, text, human action or other recognized object,event, place or other thing specified via the search terms 398.

In addition to searching based on text based search terms or otherdescriptors, the video server 80 also presents the option for users tosearch the video library 82 based on a video clip such as search clip399. The search clip 399 may be of different resolution, bit-rate andframe-rate from the corresponding video in video library 82. Thesearchable index 162 can contain resolution invariant and bit-rateinvariant frame-level searching features that can be correlated on ashot by shot basis to a shot or shots contained in search clip 399 inorder to determine a level of correlation or match between the searchclip 399 and one or more videos of the video library 82.

In an example of operation, the video processing system 102 generatesthe searchable index by processing video signals in the video library.The video processing system 102 generates hierarchical search featuresincluding frame level temporal and spatial information such asnormalized variance such as the variance matrix of a single picture,changes in the variance matrix and trends throughout multiple picturesof a video, motion density and changes in motion density throughout avideo, color information and changes in color information throughout avideo, main part motions and bit consumption, etc. The hierarchicalsearch features can further include shot-level features such as shottransition temporal intervals, shot motion, statistical features anddeveloping features relating to shot segmentation. The search featuresare stored in the searchable index 162. In response to a search request,the video processing system 102 generates index data 115 in the form ofsimilar hierarchical search features in conjunction with an input videoclip 399 during frame-level or macroblock level decoding, encoding ortranscoding. The search module 86 generates one or more matching videosfrom the video library 82 by matching the hierarchical search featuresfrom the video clip 399 to one or more corresponding features containedin the searchable index 162. Correlation can be compared on a shot byshot basis.

In this way video clips can be processed and searched based on their ownsearch features in multiple layers in a search hierarchy. For instance,shot-level information can be used to preliminarily locate shots ofpotential interest and frame information can be used to further matchpotential shots corresponding to videos in the video library 82 tocorresponding shots of the video clip 399. The search module locates apotential shot in the searchable index based on a comparison of at leastone shot-level search feature of the video shot to a correspondingsearch feature in the searchable index. The search module thenidentifies the at least one matching video by comparing the frame levelfeatures to the corresponding features of the potential shot. The searchmodule 86 can optionally allow a clip consisting of discontinuous shotsof a video to be matched, based on this shot and frame leveldecomposition of hierarchical search features. While described in theexample above in terms of a two level hierarchy with frame andsegmentation levels, with alternative levels or one or more levelsdifferent from or in addition to those described above.

In an embodiment, the matching confidence level between the hierarchicalsearch features of the video clip 399 and the corresponding features ofthe searchable index 162 can be generated for each of the search resultsthat can be used to select search results for output, rank searchresults for the user to indicate most likely matches as well as toreport the level of confidence with each matching video. The searchmodule 86 can be implemented using a single processing device or aplurality of processing devices. Such a processing device may be amicroprocessor, co-processors, a micro-controller, digital signalprocessor, microcomputer, central processing unit, field programmablegate array, programmable logic device, state machine, logic circuitry,analog circuitry, digital circuitry, and/or any device that manipulatessignals (analog and/or digital) based on operational instructions thatare stored in a memory. The memory may be a single memory device or aplurality of memory devices. Such a memory device can include a harddisk drive or other disk drive, read-only memory, random access memory,volatile memory, non-volatile memory, static memory, dynamic memory,flash memory, cache memory, and/or any device that stores digitalinformation. Note that when the search module 86 implements one or moreof its functions via a state machine, analog circuitry, digitalcircuitry, and/or logic circuitry, the memory storing the correspondingoperational instructions may be embedded within, or external to, thecircuitry comprising the state machine, analog circuitry, digitalcircuitry, and/or logic circuitry. The video library 82 and thesearchable index 162 can be stored in a memory such as a hard disk driveor other disk drive, read-only memory, random access memory, volatilememory, non-volatile memory, static memory, dynamic memory, flashmemory, cache memory, and/or any device that stores digital information.

FIG. 23 presents a flowchart representation of a method in accordancewith an embodiment of the present invention. In particular a method ispresented for use in conjunction with one more functions and featuresdescribed in conjunction with FIGS. 1-22. In step 400, a processed videois generated via a video codec along with coding feedback data thatincludes shot transition data that identifies temporal segments in theimage sequence corresponding to a plurality of video shots that eachinclude a plurality of images in the image sequence. In step 402, theshot transition data is processed to segment the video signal into shotdata corresponding to the plurality of shots, based on the codingfeedback data. In step 404, the shot data is analyzed to generatepattern recognition data that identifies at least one pattern ofinterest in conjunction with at least one of the plurality of shots.

The coding feedback data can be generated in conjunction with at leastone of: a decoding of the video signal, an encoding of the video signaland a transcoding of the video signal. The shot transition data can begenerated based on at least one image statistic or based on group ofpicture data. The coding feedback data can include at least one imagestatistic and the pattern recognition data can be generated based on theat least one image statistic. At least one of the plurality of shotsincludes a plurality of images in the image sequence and the patternrecognition data can be generated based on a temporal recognitionperformed over the plurality of images.

FIG. 24 presents a flowchart representation of a method in accordancewith an embodiment of the present invention. In particular a method ispresented for use in conjunction with one more functions and featuresdescribed in conjunction with FIGS. 1-23. In step 410, a searchableindex is generated from the index data by correlating common contentfrom the plurality of shots.

FIG. 25 presents a flowchart representation of a method in accordancewith an embodiment of the present invention. In particular a method ispresented for use in conjunction with one more functions and featuresdescribed in conjunction with FIGS. 1-24. In step 420, encoder feedbackdata is generated in conjunction with the encoding of the image sequencevia an encoder section. In step 422, a pattern of interest is detectedin the image sequence, based on the encoder feedback data. In step 424,pattern recognition data is generated when the pattern of interest isdetected, wherein the pattern recognition data indicates the pattern ofinterest.

In an embodiment, the processed video signal is generated based onpattern recognition feedback that includes the pattern recognition data.The pattern recognition data can include an identification of a patternof interest and the encoding of the image sequence can be guided basedon pattern recognition feedback that indicates the identification of thepattern of interest. The pattern recognition feedback can furtherinclude region identification data that identifies a region of interestand the encoding of the image sequence can be modified based on theregion identification data.

The encoder feedback data can include shot transition data thatidentifies temporal segments in the image sequence corresponding to aplurality of video shots. The pattern recognition data can be generatedto correspond to at least one of the plurality of video shots. At leastone of the plurality of shots can include a plurality of images in theimage sequence and the pattern recognition data can be generated basedon a temporal recognition performed over the plurality of images.

FIG. 26 presents a flowchart representation of a method in accordancewith an embodiment of the present invention. In particular a method ispresented for use in conjunction with one more functions and featuresdescribed in conjunction with FIGS. 1-25. In step 430, indexing data isgenerated that includes an identification of the pattern of interest andan identification of at least one corresponding shot of the plurality ofvideo shots that includes the pattern of interest.

FIG. 27 presents a flowchart representation of a method in accordancewith an embodiment of the present invention. In particular a method ispresented for use in conjunction with one more functions and featuresdescribed in conjunction with FIGS. 1-26. In step 440, encoder feedbackdata is generated in conjunction with the encoding of the image sequencevia an encoder section. In step 442, a face is detected in the imagesequence, based on the encoder feedback data. In step 444, patternrecognition data is generated when the face is detected, wherein thepattern recognition data indicates presence of the face.

In an embodiment, the encoder feedback data includes shot transitiondata that identifies temporal segments in the image sequencecorresponding to a plurality of video shots. At least one of theplurality of shots can include a plurality of images in the imagesequence and the pattern recognition data can be generated based on atemporal recognition performed over the plurality of images. Temporalrecognition can tracks a candidate facial region over the plurality ofimages and detect a facial region based on an identification of facialmotion in the candidate facial region over the plurality of images,wherein the facial motion includes at least one of: eye movement; andthe mouth movement. The pattern recognition data can include patternrecognition feedback that indicates the location of the facial region,and the encoding of the image sequence can be guided based on thelocation of the facial region. The pattern recognition data can includepattern recognition feedback that further indicates the location of atleast one of: eyes in the facial region; and a mouth in the facialregion. Temporal recognition can track a candidate facial region overthe plurality of images and extract three-dimensional features based ondifferent facial perspectives included in the plurality of images. Theencoder feedback data can includes at least one image statistic and/ormotion vector data.

FIG. 28 presents a flowchart representation of a method in accordancewith an embodiment of the present invention. In particular a method ispresented for use in conjunction with one more functions and featuresdescribed in conjunction with FIGS. 1-27. In step 450, encoder feedbackdata is generated in conjunction with the encoding of the image sequencevia an encoder section. In step 452 a text region is detected in theimage sequence, based on the encoder feedback data. In step 454, patternrecognition data is generated when the text region is detected, whereinthe pattern recognition data indicates presence of the text region andthe text string.

In an embodiment, the pattern recognition data includes a location ofthe region of text and the encoding of the image sequence is modifiedbased on pattern recognition feedback that indicates the presence oftext in at least one image and indicates the location of the region oftext. The encoder feedback data can include at least one image statisticand/or motion vector data.

FIG. 29 presents a flowchart representation of a method in accordancewith an embodiment of the present invention. In particular a method ispresented for use in conjunction with one more functions and featuresdescribed in conjunction with FIGS. 1-28 wherein the encoder feedbackdata includes shot transition data that identifies temporal segments inthe image sequence corresponding to a plurality of video shots. In step460, index data is generated that correlates the region of text to atleast one of the plurality of video shots.

FIG. 30 presents a flowchart representation of a method in accordancewith an embodiment of the present invention. In particular a method ispresented for use in conjunction with one more functions and featuresdescribed in conjunction with FIGS. 1-29. In step 470, a text string isgenerated by recognizing text in the region of text. In step 472, indexdata is generated that includes the text string and correlates the textstring to at least one of the plurality of video shots.

FIG. 31 presents a flowchart representation of a method in accordancewith an embodiment of the present invention. In particular a method ispresented for use in conjunction with one more functions and featuresdescribed in conjunction with FIGS. 1-30. In step 480, encoder feedbackdata is generated in conjunction with the encoding of the image sequencevia an encoder section. In step 482, a region of human action isdetected in the image sequence, based on the encoder feedback data. Instep 484, pattern recognition data is generated when the region of humanaction is detected, wherein the pattern recognition data indicatespresence of the region of human action.

In an embodiment, the pattern recognition data includes a location ofthe region of human action and the video encoder guides the encoding ofthe image sequence based on pattern recognition feedback that indicatesthe presence of human action in at least one image and indicates thelocation of the region of text. The encoder feedback data can includemotion vector data.

FIG. 32 presents a flowchart representation of a method in accordancewith an embodiment of the present invention. In particular a method ispresented for use in conjunction with one more functions and featuresdescribed in conjunction with FIGS. 1-31 wherein the encoder feedbackdata includes shot transition data that identifies temporal segments inthe image sequence corresponding to a plurality of video shots that eachinclude a plurality of images in the image sequence. In step 490, indexdata is generated that correlates the region and descriptor of humanaction to at least one of the plurality of video shots.

FIG. 33 presents a flowchart representation of a method in accordancewith an embodiment of the present invention. In particular a method ispresented for use in conjunction with one more functions and featuresdescribed in conjunction with FIGS. 1-32. In step 500, at least onehuman action descriptor is generated by recognizing human action in theregion of human action. The index data can include the at least onehuman action descriptor correlated to at least one of the plurality ofvideo shots. These human action descriptors can be generated based on atemporal recognition performed over the plurality of images.

FIG. 34 presents a flowchart representation of a method in accordancewith an embodiment of the present invention. In particular a method ispresented for use in conjunction with one more functions and featuresdescribed in conjunction with FIGS. 1-33. The human action descriptorsare generated by: identifying a plurality of moving objects in theplurality of images in step 510, discriminating at least one human fromthe plurality of moving objects in step 512, tracking the action of thehuman in the plurality of images in step 514 and by recognizing a humanaction in the plurality of images as shown in step 516.

FIG. 35 presents a flowchart representation of a method in accordancewith an embodiment of the present invention. In particular a method ispresented for use in conjunction with one more functions and featuresdescribed in conjunction with FIGS. 1-34. In step 520, a searchableindex is stored that includes search features corresponding to videoscontained in a video library. In step 522, a video signal is decoded andsearch features of the video signal are generated in conjunction withthe decoding. In step 524, at least one matching video of the videolibrary is identified by comparing the search features of the videosignal to corresponding search features of the searchable index.

The search features can include at least one shot-level search featureand at least one frame level feature. Step 522 can include segmenting animage sequence of the video signal into shot data corresponding to aplurality of shots, and generating the shot-level search feature basedon the shot data. The search features can include hierarchical searchfeatures. Step 524 can include locating a potential shot in thesearchable index based on a comparison of at least one shot-level searchfeature of the video signal to a corresponding search feature in thesearchable index. Step 524 can further include comparing the at leastone frame level feature to the corresponding feature of the potentialshot.

FIG. 36 presents a flowchart representation of a method in accordancewith an embodiment of the present invention. In particular a method ispresented for use in conjunction with one more functions and featuresdescribed in conjunction with FIGS. 1-35. In step 530, the searchableindex is generated based on a processing of the videos of the videolibrary.

FIG. 37 presents a flowchart representation of a method in accordancewith an embodiment of the present invention. In particular a method ispresented for use in conjunction with one more functions and featuresdescribed in conjunction with FIGS. 1-36. In step 540, a matchingconfidence level is generated corresponding to the at least one matchingvideo.

As may be used herein, the terms “substantially” and “approximately”provides an industry-accepted tolerance for its corresponding termand/or relativity between items. Such an industry-accepted toleranceranges from less than one percent to fifty percent and corresponds to,but is not limited to, component values, integrated circuit processvariations, temperature variations, rise and fall times, and/or thermalnoise. Such relativity between items ranges from a difference of a fewpercent to magnitude differences. As may also be used herein, theterm(s) “operably coupled to”, “coupled to”, and/or “coupling” includesdirect coupling between items and/or indirect coupling between items viaan intervening item (e.g., an item includes, but is not limited to, acomponent, an element, a circuit, and/or a module) where, for indirectcoupling, the intervening item does not modify the information of asignal but may adjust its current level, voltage level, and/or powerlevel. As may further be used herein, inferred coupling (i.e., where oneelement is coupled to another element by inference) includes direct andindirect coupling between two items in the same manner as “coupled to”.As may even further be used herein, the term “operable to” or “operablycoupled to” indicates that an item includes one or more of powerconnections, input(s), output(s), etc., to perform, when activated, oneor more its corresponding functions and may further include inferredcoupling to one or more other items. As may still further be usedherein, the term “associated with”, includes direct and/or indirectcoupling of separate items and/or one item being embedded within anotheritem. As may be used herein, the term “compares favorably”, indicatesthat a comparison between two or more items, signals, etc., provides adesired relationship. For example, when the desired relationship is thatsignal 1 has a greater magnitude than signal 2, a favorable comparisonmay be achieved when the magnitude of signal 1 is greater than that ofsignal 2 or when the magnitude of signal 2 is less than that of signal1.

As may also be used herein, the terms “processing module”, “processingcircuit”, and/or “processing unit” may be a single processing device ora plurality of processing devices. Such a processing device may be amicroprocessor, micro-controller, digital signal processor,microcomputer, central processing unit, field programmable gate array,programmable logic device, state machine, logic circuitry, analogcircuitry, digital circuitry, and/or any device that manipulates signals(analog and/or digital) based on hard coding of the circuitry and/oroperational instructions. The processing module, module, processingcircuit, and/or processing unit may be, or further include, memoryand/or an integrated memory element, which may be a single memorydevice, a plurality of memory devices, and/or embedded circuitry ofanother processing module, module, processing circuit, and/or processingunit. Such a memory device may be a read-only memory, random accessmemory, volatile memory, non-volatile memory, static memory, dynamicmemory, flash memory, cache memory, and/or any device that storesdigital information. Note that if the processing module, module,processing circuit, and/or processing unit includes more than oneprocessing device, the processing devices may be centrally located(e.g., directly coupled together via a wired and/or wireless busstructure) or may be distributedly located (e.g., cloud computing viaindirect coupling via a local area network and/or a wide area network).Further note that if the processing module, module, processing circuit,and/or processing unit implements one or more of its functions via astate machine, analog circuitry, digital circuitry, and/or logiccircuitry, the memory and/or memory element storing the correspondingoperational instructions may be embedded within, or external to, thecircuitry comprising the state machine, analog circuitry, digitalcircuitry, and/or logic circuitry. Still further note that, the memoryelement may store, and the processing module, module, processingcircuit, and/or processing unit executes, hard coded and/or operationalinstructions corresponding to at least some of the steps and/orfunctions illustrated in one or more of the Figures. Such a memorydevice or memory element can be included in an article of manufacture.

The present invention has been described above with the aid of methodsteps illustrating the performance of specified functions andrelationships thereof. The boundaries and sequence of these functionalbuilding blocks and method steps have been arbitrarily defined hereinfor convenience of description. Alternate boundaries and sequences canbe defined so long as the specified functions and relationships areappropriately performed. Any such alternate boundaries or sequences arethus within the scope and spirit of the claimed invention. Further, theboundaries of these functional building blocks have been arbitrarilydefined for convenience of description. Alternate boundaries could bedefined as long as the certain significant functions are appropriatelyperformed. Similarly, flow diagram blocks may also have been arbitrarilydefined herein to illustrate certain significant functionality. To theextent used, the flow diagram block boundaries and sequence could havebeen defined otherwise and still perform the certain significantfunctionality. Such alternate definitions of both functional buildingblocks and flow diagram blocks and sequences are thus within the scopeand spirit of the claimed invention. One of average skill in the artwill also recognize that the functional building blocks, and otherillustrative blocks, modules and components herein, can be implementedas illustrated or by discrete components, application specificintegrated circuits, processors executing appropriate software and thelike or any combination thereof.

The present invention may have also been described, at least in part, interms of one or more embodiments. An embodiment of the present inventionis used herein to illustrate the present invention, an aspect thereof, afeature thereof, a concept thereof, and/or an example thereof. Aphysical embodiment of an apparatus, an article of manufacture, amachine, and/or of a process that embodies the present invention mayinclude one or more of the aspects, features, concepts, examples, etc.described with reference to one or more of the embodiments discussedherein. Further, from figure to figure, the embodiments may incorporatethe same or similarly named functions, steps, modules, etc. that may usethe same or different reference numbers and, as such, the functions,steps, modules, etc. may be the same or similar functions, steps,modules, etc. or different ones.

Unless specifically stated to the contra, signals to, from, and/orbetween elements in a figure of any of the figures presented herein maybe analog or digital, continuous time or discrete time, and single-endedor differential. For instance, if a signal path is shown as asingle-ended path, it also represents a differential signal path.Similarly, if a signal path is shown as a differential path, it alsorepresents a single-ended signal path. While one or more particulararchitectures are described herein, other architectures can likewise beimplemented that use one or more data buses not expressly shown, directconnectivity between elements, and/or indirect coupling between otherelements as recognized by one of average skill in the art.

The term “module” is used in the description of the various embodimentsof the present invention. A module includes a processing module, afunctional block, hardware, and/or software stored on memory forperforming one or more functions as may be described herein. Note that,if the module is implemented via hardware, the hardware may operateindependently and/or in conjunction software and/or firmware. As usedherein, a module may contain one or more sub-modules, each of which maybe one or more modules.

While particular combinations of various functions and features of thepresent invention have been expressly described herein, othercombinations of these features and functions are likewise possible. Thepresent invention is not limited by the particular examples disclosedherein and expressly incorporates these other combinations.

What is claimed is:
 1. A system for processing a video signal into aprocessed video signal, the video signal including an image sequence,the system comprising: a pattern recognition module for detecting aregion of human action in the image sequence based on coding feedbackdata and generating pattern recognition data in response thereto; and avideo codec, coupled to the pattern recognition module, that generatesthe processed video signal by processing the image sequence and bygenerating the codec feedback data in conjunction with the processing ofthe image sequence.
 2. The system of claim 1 wherein the patternrecognition data includes a location of the region of human action andwherein the encoder section guides the encoding of the image sequencebased on pattern recognition feedback that indicates the presence ofhuman action in at least one image and indicates the location of theregion of human action.
 3. The system of claim 1 wherein the codingfeedback data includes shot transition data that identifies temporalsegments in the image sequence corresponding to a plurality of videoshots that each includes a plurality of images in the image sequence. 4.The system of claim 3 wherein the pattern recognition module furthergenerates index data that correlates the region of human to at least oneof the plurality of video shots.
 5. The system of claim 3 wherein thepattern recognition module generates at least one human actiondescriptor by recognizing the human action and further generates indexdata that includes the at least one human action descriptor correlatedto at least one of the plurality of video shots.
 6. The system of claim5 wherein the pattern recognition module generates the at least onehuman action descriptor based on a temporal recognition performed overthe plurality of images.
 7. The system of claim 6 wherein the patternrecognition module generates the at least one human action descriptorby: identifying a plurality of moving objects in the plurality ofimages; discriminating at least one human from the plurality of movingobjects; tracking the action of the human in the plurality of images;and recognizing a human action in the plurality of images.
 8. The systemof claim 1 wherein the coding feedback data includes motion vector data.9. A method for encoding a video signal into a processed video signal,the video signal including an image sequence, the method comprising:generating encoder feedback data in conjunction with the encoding of theimage sequence via an encoder section; detecting a region of humanaction in the image sequence, based on the encoder feedback data; andgenerating pattern recognition data when the region of human action isdetected, wherein the pattern recognition data indicates presence of theregion of human action.
 10. The method of claim 9 wherein the patternrecognition data includes a location of the region of human action andwherein the video encoder guides the encoding of the image sequencebased on pattern recognition feedback that indicates the presence ofhuman action in at least one image and indicates the location of theregion of human action.
 11. The method of claim 9 wherein the encoderfeedback data includes shot transition data that identifies temporalsegments in the image sequence corresponding to a plurality of videoshots that each include a plurality of images in the image sequence. 12.The method of claim 11 further comprising: generating index data thatcorrelates the region of human action to at least one of the pluralityof video shots.
 13. The method of claim 11 further comprising:generating at least one human action descriptor by recognizing humanaction in the region of human action; wherein the index data includesthe at least one human action descriptor correlated to at least one ofthe plurality of video shots.
 14. The method of claim 13 wherein the atleast one human action descriptor is generated based on a temporalrecognition performed over the plurality of images.
 15. The method ofclaim 14 wherein the at least one human action descriptor is generatedby: identifying a plurality of moving objects in the plurality ofimages; discriminating at least one human from the plurality of movingobjects; tracking the action of the human in the plurality of images;and recognizing a human action in the plurality of images.
 16. Themethod of claim 9 wherein the encoder feedback data includes motionvector data.